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A  NOTE  ON  PROFILE  LIKELIHOOD,  LEAST  FAVOURABLE  FAMILIES 
AND  KULLBACK-LEEBLER  DISTANCE 


Robert  Tibshirani  and  Larry  Wasserman 


SUMMARY 

We  consider  several  methods  for  reducing  high  dimensional  models  to  one  dimen¬ 
sional  models  for  the  purpose  of  simplifying  likelihood  inferences.  The  equivalence 
between  these  methods  is  investigated. 

Some  Key  Words  :  nuisance  parameters,  likelihood,  exponential  families. 


1.  INTRODUCTION 

Consider  a  statistical  model  T  consisting  of  a  class  of  densities  (f(x|ri)}  where 


T]  €  fl  c  Rk  is  a  vector-valued  parameter  of  dimension  greater  than  one.  Often  we  are 


interested  in  a  real  valued  function  0=8(T|).  Many  useful  inferential  techniques  involve 
the  log -likelihood  function  defined  by 

LOl)=a  +  S>g(f(xj|T])) 
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dimensional  likelihood  function  is  not  available  for  the  parameter  of  interest  0.  The 
problem  of  constructing  such  a  likelihood  function  is  discussed  at  length  by  Kalbfleisch 
and  Sprott  (1970). 

The  method  that  we  discuss  here  for  dealing  with  this  problem  is  to  strategically 
choose  a  sub-family  of  densities  from  T  indexed  by  0.  We  then  construct  a  likelihood 
function  based  on  the  new  reduced  model.  Specifically,  let  Tq  =  {f(x  1 0)}  denote  the 
reduced  model.  (When  convenient,  Te  will  also  refer  to  the  corresponding  curve  in  the 
parameter  space  Q).  We  then  take  L(0)  =  ]£log(f(Xj  1 0)).  (Unqualified  sums  are  to  be 
taken  from  i=l  to  n). 

We  shall  consider  several  such  techniques  for  choosing  re  and  investigate  certain 
equivalences  between  them.  We  note  that  some  of  the  methods  of  model  reduction  that 
we  discuss  were  originally  proposed  for  reasons  other  than  constructing  likelihood  func¬ 
tions. 


2.  DEFINITIONS 


The  first  reduction  technique  we  consider  is  used  to  construct  the  profile  likelihood  (see 
Kalbfleisch  and  Sprott,  1970).  Let  f(x  1 0)  be  the  density  which  maximizes  the  probability 
of  the  observed  data  subject  to  0(T|)  =  0.  This  defines  a  family  indexed  by  0  which  we 
will  denote  by  TeL.  The  resulting  log-likelihood  function  will  be  denoted  by  Lpl(0). 
Note  that  Lpl(0)  passes  through  the  global  maximum  of  L(t|). 

Another  method  of  defining  a  one  parameter  family  is  what  Stein  (1956)  calls  the 


"least  favourable  family"  given  by 


Ti(x)  =  Ti  +  tI^IV0(Ti) 
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where  I^i  is  the  observed  Fisher  information  at  tj  and  V0(t|)  is  the  vector  whose  i*  com¬ 


ponent  is  96/drji.  This  traces  a  straight  line  through  T|  having  direction  I^1  V0(ri). 


Denote  this  family  by  f|  and  the  corresponding  log-likelihood  by  Ls(8).  This  family 


has  the  property  that  the  observed  Fisher  information  for  6  is  the  same  as  in  the  full  fam¬ 


ily  f(x  1 1|)  (Stein,  1956).  Furthermore,  any  other  (linear)  sub-family  through  rj  has 


greater  Fisher  information  for  0.  (Note  that  Stein  used  expected  information  in  his 


definition.  Here  we  follow  Efron  (1984)  and  use  the  observed  information). 


Still  another  reduction  method  is  employed  by  Efron  (1981,1984)  for  the  purpose  of 


constructing  confidence  intervals  in  multi-parameter  and  non-parametric  settings.  Let 


Ce0  =  {ti:e(n)  =  e0}, 


the  level  surface  of  constant  8.  Efron  selects  the  value  of  r|  from  such  that  the 


Kullback-Leibler  distance 


K(t),T|)  =  Jf(x  |  Tj)log(f(x  |  T|)/f(x  |  t|))p(dx) 


is  minimized,  where  p.  is  a  dominating  measure  for  the  family  T.  As  %  varies,  this 


defines  a  one  parameter  family.  Since  Kullback-Leibler  distance  is  not  symmetric,  one 


can  create  a  "forward"  or  "backward”  family  using  K(q,T))  or  K(Ti,r|),  respectively.  The 


corresponding  families  will  be  denoted  by  r{j  and  T§  and  their  log-likelihoods  by  LF(0) 


and  Lb(0). 


In  section  3  we  find  the  directions  of  the  families  at  q.  Conditions  under  which  the 


families  are  equivalent  will  be  derived.  In  section  4  we  consider  two  examples. 
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3.  LOCAL  EQUIVALENCE  OF  FAMILIES 
The  directions  of  the  families  can  be  found  using  the  following  lemma. 

Lemma:  Let  g:  Rk  — »R  be  three  times  differentiable  with  invertible  Hessian  V2g  and 
global  minimum  at  Rk.  Let  8  :  Rk  -*  R  be  continuously  differentiable  with  non-zero 

A 

gradient  on  a  neighbourhood  of  t|.  Define  a  curve  c(t)  implicitly  by 


g(c(t))  =  min  g(T|) 

c, 

A 

where  Ct  =  (t|:  0(rj)  =  t}  and  let  c(to)  =  Tj.  Then 


D(C(0)  1 t,  =  ^  I  u  =  Xo(V2g(ti)r 1 V  8(11) 


where 


Xo  =  i(V0(ii))t(v2g(n)r,(ve(n»r1. 

Proof:  First  note  that  c(t)  is  differentiable  by  the  implicit  function  theorem  (Spiv ak, 
1965,  p.  41).  Now,  since  c(t)  is  defined  as  a  minimum  we  have  (using  Lagrange  multi¬ 


pliers) 


Vg(c(t))  =  X(t)V0(c(t)). 


Differentiation  with  respect  to  t  gives 

(V2g)D(c(t))  =  X(t)(V20)D(c(t))  +  X/(t)V0 

where  V2h  is  a  matrix  with  i,j*  entry  c^h/drjjchij  fo r  a  function  h.  Evaluating  this 
expression  at  to  gives  the  form  of  D(c(t))  I  ^ .  (Note  that  X(t)  =  0  at  t  =  to).  The  constant 
is  determined  by  differentiating  the  Lagrange  equation  with  respect  to  X  then  L 


Now  let  Df,  be  the  direction  vector  of  a  particular  family  at  q  where  a  =  PL,S,F  or  B 
to  indicate  the  appropriate  family.  We  have 


rf  < 


Theorem:  DPL  =  D®  =  a(l^rf  VB(r\)  and  D*  =  D®  =  oO^i?  V0(q)  where 

a(A)  =  [(veri))‘A -‘(veobr1. 


Proof:  The  direction  of  rf  is  (I-1  V0)ch/d0  which  equals  ctfl-n)!^1  V0(q)  since 
dxldQ  =  o(I^).  The  direction  of  r§L  follows  from  the  lemma  by  taking  g  to  be  minus  the 
log-likelihood  and  assuming  the  usual  regularity  conditions.  The  directions  of  r[>  and 
are  obtained  by  noting  that  to  second  order  terms 

K(q,r|)  =  K(t|,ti)  =  y  (q  -  q)ri‘  (q  -  q) 

(see  Kullback  (1959,  p.  28)).  Applying  the  lemma  yields  the  result 

Therefore  Stein’s  family  and  the  profile  likelihood  family  are  locally  equivalent  as 
are  the  two  Kullback  families.  A  sufficient  condition  for  i^=I^  is  that  the  model  be  a 
member  of  the  exponential  family.  Hence  in  this  case  all  four  families  are  locally 
equivalent.  It  can  also  be  shown,  using  Hoeffding’s  lemma  (Efron,  1978)  that  in  the 
exponential  family,  the  profile  family  and  the  forward  Kullback  family  are  globally 
equivalent 

Outside  of  the  exponential  family,  and  arc  in  general  different;  their  difference 
can  be  expressed  as  a  function  of  statistical  curvature  (Efron,  1975  and  Skovgard,  1985). 

The  theorem  suggests  that  inferential  techniques  based  on  the  local  properties  of  the 
likelihood  function  will  be  similar  for  all  four  methods.  In  particular,  note  that  the 
second  derivative  (at  q)  of  the  log-likelihoods  is  (Djj)1 1^1  (D^)  for  a  =  PL  and  a  =  S  and 
is  (D^)‘  i^(D^)  for  a  =  B  and  F.  Hence  Lpl(0)  and  Ls(0)  have  the  same  second  deriva¬ 
tives  as  do  LF(0)  and  LB(0).  Agreement  of  the  third  derivatives  can  be  shown  by  a  simi- 
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lar  calculation.  One  application  of  this  result  would  be  the  approximation  of  Lpl(9)  by 
Ls(0).  This  can  provide  considerable  simplification  since  Ls(0)  requires  only  the  com¬ 
putation  of  1^,  and  V0(q)  while  Lpl(0)  requires  a  restricted  maximization  at  each  value  of 
0.  This  will  be  difficult  if  0(q)  is  a  complicated  function  of  q.  However,  the  quality  of 
such  an  approximation  is  still  an  open  question. 


4.  EXAMPLES 


Example  1. 


Let  x  be  bivariate  normal  with  mean  q  and  covariance  equal  to  the  identity  matrix. 
Suppose  the  parameter  of  interest  is  0  =  qj/q2.  Note  that  0  is  constant  over  rays  through 

M  A 

the  origin  in  E  .  It  is  easy  to  show  that  K(q,q)  reduces  to  1/2  times  the  squared 

A 

Euclidean  distance  between  q  and  q  so  that  the  forward  and  backward  Kullback-Leibler 
families  and  the  profile  likelihood  based  family  all  correspond  to  the  circle  passing 

A 

through  the  origin  and  q  (see  Figure  la).  The  corresponding  likelihood  functions  are 
plotted  in  Figure  lb. 


Example  2. 

This  example  is  motivated  by  Efron’s  (1984)  use  of  the  least  favourable  family  in 
computing  bootstrap  confidence  intervals.  The  data  x1>X2,...x„  are  fixed  and  the  family 
of  rescaled  multinomial  distributions  M(n,co)/n  is  considered.  The  parameter  of  interest 
is  a  functional  0(d)).  The  natural  parameter  is  q  =  log  to.  A  least  favourable  family  is 
drawn  through  the  m.l.e  <00  =  6^=  (1/n,  l/n,...l/n).  We  illustrate  this  in  Figure  2  for  n  =  3 
with  0(c))  =  xw  =  £<OjXj  and  (xj.x2.x3)  =  (-1,0,1).  The  triangle  represents  the  simplex 


S3  =  {0)10)1^0,  =  1). 


The  least  favourable  family  and  backward  Kullback-Leibler  family  agree  while  the 
profile  likelihood  and  forward  Kullback-Leibler  families  coincide.  Also  shown  are  the 
level  curves  C^. 
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Twice  log- like lihood  for  least  favourable  family  (dotted  line) 
and  other  families  (solid  line)  corresponding  to  problem 
of  Figure  1. 
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